Code
# load in data
library(tidyverse)
library(gganimate)
library(gifski)
library(broom)
library(kableExtra)For our final project, we chose to focus on CO2 emissions per capita and life expectancy. We use data sets taken from Gapminder, which have been combined and collected from a variety of sources. The co2_emissions data set contains 194 observations (countries) with 224 variables (ranging in year from 1800 to 2022). These data represent the recorded consumption-based CO2 emissions, in tonnes of CO2 per capita. The life_expectancy data set contains 195 observations (countries) with 302 variables (ranging in year from 1800 to 2100). These data represent the number of years a newborn infant would live assuming the mortality rate at their birth remains constant throughout their life.
# load in data
library(tidyverse)
library(gganimate)
library(gifski)
library(broom)
library(kableExtra)# Data
co2_emissions <- read_csv(here::here("co2_pcap_cons.csv"))
life_expectancy <- read_csv(here::here("lex.csv"))
# Pivoting Longer
co2_emissions_long <- co2_emissions |>
mutate(across(`1800`:`2022`, ~ str_replace_all(.x, "−", "-"))) |>
mutate(across(`1800`:`2022`, ~ as.numeric(.x))) |>
pivot_longer(
cols = `1800`:`2022`,
names_to = "year",
values_to = "co2"
)
life_exp_long <- life_expectancy |>
mutate(across(`1800`:`2100`, ~ as.numeric(.x))) |>
pivot_longer(
cols = `1800`:`2100`,
names_to = "year",
values_to = "life_exp"
)
# Join Datasets
project_data <- co2_emissions_long |>
inner_join(life_exp_long,
by = c("country", "year")
) |>
mutate(year = as.numeric(year))When downloading our desired data sets, CO2 emissions per capita and life expectancy we received some warnings during our attempts to pivot our data sets. We came to learn that R was reading in the negative dashes as character variables as opposed to numeric, specifically in our CO2 emissions data.
To fix this issue, our group utilized str_replace_all (string replace) in order to replace the character variable with negative signs. This process was only applied for the CO2 emissions data.
We applied as.numeric for both data sets, and lastly pivoted long each data to receive our variables country, year and the data collected per data set.
In our model we will analyze CO2 emissions as the explanatory variable and life expectancy as the response variable. We expect a negative relationship between CO2 emissions and life expectancy, where countries that produce greater amounts of CO2 tend to have lower life expectancy. We assume CO2 emissions decrease air quality and cause respiratory issues, decreasing life expectancy.
# Visualization 1:
animated <- project_data |>
ggplot(aes(x = co2,
y = life_exp)) +
geom_point(alpha = 0.7, show.legend = FALSE,
na.rm = TRUE) +
labs(title = "CO2 Emissions vs. Life Expectancy per Year: {round(frame_time)}",
x = "CO2 Emissions (tonnes per capita)",
y = "",
subtitle = "Life Expectancy (years)") +
transition_time(year) +
ease_aes("linear") +
theme_bw()
animate(animated, renderer = gifski_renderer(), fps = 7)Figure 1 depicts the relationship between the explanatory variable of carbon dioxide emissions per capita in tonnes and the response variable of life expectancy in years over years 1800 to 2022. It appears that with each year, both CO2 emissions and life expectancy increase. They likely both increase over time due to technology advancements. While the effects of CO2 emissions can create an impact on an individuals health, society still progresses in fields such as health care.
# Visualization 2:
project_data |>
group_by(country) |>
summarize(avg_co2 = mean(co2),
avg_life_exp = mean(life_exp)) |>
ggplot(aes(x = avg_co2,
y = avg_life_exp)) +
geom_jitter(na.rm = TRUE) +
geom_smooth(method = "lm", na.rm = TRUE) +
theme_bw() +
labs(title = "Averge CO2 Emissions and Life Expectancy per Country",
x = "Average CO2 Emissions (tonnes per capita)",
y = NULL,
subtitle = "Average Life Expectancy (years)")Figure 2 depicts the average carbon dioxide emissions per capita in tonnes per country and the average life expectancy per country over the years 1800 to 2022. It appears that there is positive weak to moderate linear relationship between the two variable. In addition to the apparent increase in life expectancy as CO2 emissions increase, there also appears to be a curved pattern as emissions increase, there is a steep slope, but it plateaus as emissions rise.
We used Simple Linear Regression (SLR) to fit a linear model between our variables. By using SLR, we assume that there is a linear relationship between average co2 emissions and average life expectancy per country, and we can use that to estimate a country’s average life expectancy (response variable) based on their average co2 emissions (explanatory variable). This function will be of the following form: response = intercept + slope * explanatory. Therefore, we can predict the average life expectancy of a country by multiplying their average co2 emissions by the model slope and adding the model intercept.
# Model:
project_lm <- project_data |>
group_by(country) |>
summarize(avg_co2 = mean(co2),
avg_life_exp = mean(life_exp))
# Coefficients:
project_lm <- lm(avg_life_exp ~ avg_co2,
data = project_lm)
broom::tidy(project_lm) |>
knitr::kable(digits = 3)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 41.149 | 0.404 | 101.908 | 0 |
| avg_co2 | 1.806 | 0.155 | 11.641 | 0 |
\(\widehat{y} = 41.149 + 1.806x\)
Our response variable is life expectancy and our explanatory is CO2 emissions. Our simple linear model provided us with the information that the average life expectancy is 41.149 years when carbon dioxide emissions are zero. The slope term in our model is stating that for every additional one tonne per capita of CO2 emissions, the estimated life expectancy goes up by 1.806 years, on average.
project_lm |>
broom::augment() |>
ggplot(aes(x = .fitted, y = .resid)) +
geom_point() +
labs(x = "Fitted Values",
y = "Residuals") +
geom_abline(slope = 0,
intercept = 0,
color = "steelblue",
linetype = "dashed",
lwd = 1.5) +
theme_bw()augment(project_lm) |>
summarize(var_response = var(avg_life_exp),
var_fit = var(.fitted),
var_resid = var(.resid)) |>
mutate(var_prop = var_fit/var_response) |>
kable(digits = 3) |>
kable_styling(latex_options = "basic",
position = "center",
row_label_position = "c")| var_response | var_fit | var_resid | var_prop |
|---|---|---|---|
| 33.054 | 14.018 | 19.036 | 0.424 |
The table shows 42.41% of variability in life expectancy is explained by our linear regression model using CO2 emissions as a predictor (\(14.02/33.04 = 0.4241\)). Accounting for not even half of the variability, this does not appear to be a particularly strong model of life expectancy.
simulation <- as.vector(predict(project_lm) + rnorm(n = 186, mean = 0, sd = sigma(project_lm)))
sim_data <- project_data |>
group_by(country) |>
summarize(avg_life_exp = mean(life_exp)) |>
filter(!is.na(avg_life_exp)) |>
mutate(sim = simulation)
sim_data |>
ggplot(mapping = aes(x = sim, y = avg_life_exp)) +
geom_point() +
labs(
title = "Observed vs. Simulated Average Life Expectancy",
x = "Simulated Average Life Expectancy per Country (Years)",
y = "",
subtitle = "Observed Average Life Expectancy per Country (Years)"
) +
geom_smooth(method = "lm") +
theme_bw()nsims <- 100
sims <- map_dfc(
.x = 1:nsims,
.f = ~ tibble(sim = as.vector(predict(project_lm) + rnorm(
n = 186,
mean = 0,
sd = sigma(project_lm)
)))
)
colnames(sims) <- colnames(sims) |>
str_replace(
pattern = "\\.\\.\\.",
replace = "_"
)
sims <- project_data |>
group_by(country) |>
summarize(avg_life_exp = mean(life_exp)) |>
filter(!is.na(avg_life_exp)) |>
select(avg_life_exp) |>
bind_cols(sims)
sims_r <- sims |>
map(~ lm(avg_life_exp ~ .x,
data = sims
)) |>
map(glance) |>
map_dbl(~ .x$r.squared)
sims_r <- sims_r[names(sims_r) != "avg_life_exp"]
tibble(sims = sims_r) |>
ggplot(aes(x = sims)) +
geom_histogram(binwidth = 0.025) +
labs(
x = expression("Simulated" ~ R^2),
y = "",
subtitle = "Number of Simulated Models"
) +
theme_bw()lm(avg_life_exp ~ sim, data = sim_data) |>
glance() |>
select(r.squared) |>
pull()[1] 0.1820868
=======
[1] 0.2021724
>>>>>>> abb11a93051eca34f7d1e727994520fe764e181f
nsims <- 100
sims <- map_dfc(
.x = 1:nsims,
.f = ~ tibble(sim = as.vector(predict(project_lm) + rnorm(
n = 186,
mean = 0,
sd = sigma(project_lm)
)))
)
colnames(sims) <- colnames(sims) |>
str_replace(
pattern = "\\.\\.\\.",
replace = "_"
)
sims <- project_data |>
group_by(country) |>
summarize(avg_life_exp = mean(life_exp)) |>
filter(!is.na(avg_life_exp)) |>
select(avg_life_exp) |>
bind_cols(sims)
sims_r <- sims |>
map(~ lm(avg_life_exp ~ .x,
data = sims
)) |>
map(glance) |>
map_dbl(~ .x$r.squared)
sims_r <- sims_r[names(sims_r) != "avg_life_exp"]
tibble(sims = sims_r) |>
ggplot(aes(x = sims)) +
geom_histogram(binwidth = 0.025) +
labs(
x = expression("Simulated" ~ R^2),
y = "",
subtitle = "Number of Simulated Models"
) +
theme_bw()head(sims)# A tibble: 6 × 101
avg_life_exp sim_1 sim_2 sim_3 sim_4 sim_5 sim_6 sim_7 sim_8 sim_9 sim_10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
<<<<<<< HEAD
1 38.0 41.8 40.9 48.3 26.8 35.6 43.6 36.4 39.8 46.0 45.1
2 47.2 40.3 42.2 44.1 47.2 48.4 38.5 35.6 41.1 45.0 49.7
3 42.1 39.5 40.9 50.3 42.6 35.3 46.6 45.2 41.2 40.8 42.6
4 37.7 39.4 42.0 46.8 33.2 38.1 48.7 43.2 41.1 32.6 47.5
5 47.3 49.3 48.0 43.7 41.0 38.2 36.5 42.9 44.9 48.0 44.3
6 49.7 38.2 40.5 42.0 44.8 50.9 44.4 41.2 46.2 52.2 36.3
=======
1 38.0 43.3 52.6 40.6 41.8 33.8 46.1 43.0 49.2 46.2 51.4
2 47.2 44.0 47.2 45.8 33.5 43.9 42.6 48.2 46.3 41.4 35.3
3 42.1 39.6 37.5 42.1 46.1 39.7 43.4 47.8 40.3 45.3 43.6
4 37.7 41.7 35.3 48.8 46.5 43.4 35.1 44.0 40.0 45.0 34.0
5 47.3 43.9 43.1 37.0 40.1 43.8 41.3 41.8 43.0 43.0 47.3
6 49.7 40.7 39.6 41.5 45.3 45.0 48.5 44.0 46.1 42.9 45.3
>>>>>>> abb11a93051eca34f7d1e727994520fe764e181f
# ℹ 90 more variables: sim_11 <dbl>, sim_12 <dbl>, sim_13 <dbl>, sim_14 <dbl>,
# sim_15 <dbl>, sim_16 <dbl>, sim_17 <dbl>, sim_18 <dbl>, sim_19 <dbl>,
# sim_20 <dbl>, sim_21 <dbl>, sim_22 <dbl>, sim_23 <dbl>, sim_24 <dbl>,
# sim_25 <dbl>, sim_26 <dbl>, sim_27 <dbl>, sim_28 <dbl>, sim_29 <dbl>,
# sim_30 <dbl>, sim_31 <dbl>, sim_32 <dbl>, sim_33 <dbl>, sim_34 <dbl>,
# sim_35 <dbl>, sim_36 <dbl>, sim_37 <dbl>, sim_38 <dbl>, sim_39 <dbl>,
# sim_40 <dbl>, sim_41 <dbl>, sim_42 <dbl>, sim_43 <dbl>, sim_44 <dbl>, …
head(sims_r) sim_1 sim_2 sim_3 sim_4 sim_5 sim_6
<<<<<<< HEAD
0.1030292 0.1362944 0.1336552 0.1557803 0.1696784 0.2004469
=======
0.2115056 0.1996862 0.1783148 0.1608613 0.2268912 0.1378525
>>>>>>> abb11a93051eca34f7d1e727994520fe764e181f
>>>>>>> f466e7b8593a64d307e28c92c69e6c7eedd939da